Nature Computational Science — Latest Matching Preprints

1

Personalized Feature Statistics: Individual-Level Variant Inference under Genetic Ancestry Continuum

Wang, J. F.; Yu, R.; Edelson, J.; Park, J.; Le Guen, Y.; Liu, X.; Belloy, M.; Ionita-Laza, I.; Greicius, M.; Tang, H.; He, Z.

2026-04-29 neurology 10.64898/2026.04.28.26351879 medRxiv

Top 0.1%

10.1%

Show abstract

Genome-wide association studies (GWAS) have successfully identified numerous genetic variants associated with complex diseases. However, the extent to which the effects of these variants vary across populations of diverse ancestries remains poorly understood. Furthermore, in these contexts genetic ancestry is treated as a categorical variable, thereby oversimplifying its continuous nature and the more nuanced ways in which it can influence genetic effects on disease. Here, we propose personalized feature statistics (PFstatistics), a statistical framework that quantifies the importance of genetic variants to a phenotype based on each individuals ancestry background, and profiles heterogeneous genetic effects across the genetic ancestry continuum. We demonstrate the utility of this framework through both simulations and real data analysis using sequencing data from ancestrally diverse cohorts in the Alzheimers Disease Sequencing Project (ADSP). We show that Alzheimers Disease (AD) risk variants span a spectrum from ancestry-homogeneous to ancestry-dependent effects, and that PFstatistics characterizes this spectrum at individual resolution across the ancestry continuum. PFstatistics also provides individual-level variant selection with FDR controlled at a target level, yielding distinct selection sets that vary across individuals according to their ancestry background. While demonstrated in the context of genetic ancestry, the proposed method is broadly applicable to other heterogeneity features such as environmental factors, offering a robust tool for understanding complex genetic contributions across diverse populations.

2

Scalable deep-learning-based inference of time-varying transmission dynamics from outbreak phylogenies

XIE, R.; Zhukova, A.; Pena, P. G.; Iglesias, G.; Hu, S.; Wang, J.; Tsang, T. K.; Dhanasekaran, V.; Kraemer, M. U. G.; Pybus, O. G.; Gascuel, O.

2026-05-10 infectious diseases 10.64898/2026.05.07.26352673 medRxiv

Top 0.1%

7.4%

Show abstract

Infectious disease dynamics can be inferred from pathogen genomic data using phylodynamic methods, but the applicability of many such approaches to large data sets is constrained by computational cost. Recent deep-learning approaches to phylodynamics have improved scalability, yet challenges remain when genetic divergence is limited during fast spreading outbreaks. To address this, we use pathogen-specific models to show that deep-learning models trained on outbreak-like phylogenies can accurately estimate the reproductive number (R) when both the birth-death model and the expected phylogenetic resolution are matched to the target pathogen, highlighting the importance of realistic training conditions. Focusing on three major respiratory pathogens of public health importance (SARS-CoV-2, seasonal human influenza virus, and respiratory syncytial virus (RSV)), we introduce PhyloRt, a scalable framework for estimating the time-varying reproductive number (Rt) from large outbreak phylogenies. PhyloRt decomposes large trees into overlapping subtrees and applies a hierarchical deep-learning-based inference strategy to classify subtrees as exhibiting constant or time-varying reproduction numbers, enabling identifiable and computationally efficient estimation of Rt as a piecewise-constant trajectory through time. Applications to SARS-CoV-2 and influenza outbreaks show that PhyloRt recovers transmission dynamics consistent with estimates derived from mathematical epidemiological and Bayesian phylodynamic analyses. Our work enables scalable and rapid estimation of time-varying transmission dynamics from very large-scale outbreak genomic data sets, supporting real-time genomic epidemiology of emerging pathogens. SignificanceEstimating changes in transmission dynamics over time is important for responding to infectious disease outbreaks. Current methods mostly rely on reported case data from epidemiological surveillance, which can be biased or incomplete due to variable testing capabilities, particularly in resource-limited settings. A complementary approach is to use viral genomes as an alternative data source. However, inferences from genomic data can be computationally intensive and have mainly been applied retrospectively. We present PhyloRt, a scalable deep-learning-based phylodynamic framework that enables fast inference of the time-varying reproductive number (Rt) from large outbreak phylogenies. Our approach is widely applicable and provides a practical approach to monitoring epidemic dynamics, complementing traditional surveillance and supporting timely public health decision-making.

3

HiFiMAP: High-resolution fast identity-by-descent mapping test

Guo, B.; Naseri, A.; Xie, Z.; Sarnowski, C.; Zhi, D.; Chen, H.

2026-05-17 genetic and genomic medicine 10.64898/2026.05.06.26352570 medRxiv

Top 0.1%

6.5%

Show abstract

Although traditional genome-wide association studies (GWAS) have identified numerous loci, they often ignore phased haplotype information. Identity-by-descent (IBD) mapping captures these extended haplotypic effects by modeling shared ancestral segments. However, standard statistical mapping of these segments scales poorly with biobank-sized cohorts and short IBD segments that capture older evolutionary events. To overcome this computational bottleneck, existing scalable IBD mapping frameworks aggregate shared segments into fixed sliding windows. While computationally efficient, this window-based approach generates association signals at a low resolution that often span hundreds of kilobases. To address this issue, here we present a novel High-resolution Fast IBD Mapping test (HiFiMAP) that takes snapshots of IBD segments at the single nucleotide polymorphism (SNP) level resolution. Simulation studies confirm that HiFiMAP maintains well-controlled type I error rates and exhibits superior statistical power for detecting rare variants and haplotype effects using short IBD segments. In a UK Biobank (UKB) benchmark (N=407,681), HiFiMAP mapped 640,899 SNPs at 1.92 CPU seconds per test, massively outperforming existing window-based methods (95.2 CPU seconds per test for 3,403 windows). Furthermore, applied to high-dimensional brain imaging phenotypes (N~36,000), HiFiMAP identified five novel associations previously undetected by standard GWAS approaches, including key central nervous system regulators like NR2F1 and NSF/WNT3. By refining large testing windows into highly specific genomic variants, HiFiMAP empowers biobank-scale, SNP-level resolution mapping to accurately pinpoint complex trait architectures.

4

Individualized Functional Deviation Mapping: Linking Heterogeneous Structural Atrophy to Convergent Network Disruption in Preclinical Alzheimer's Disease

Tellaetxe-Elorriaga, I.; Jimenez-Marin, A.; Diez, I.; Erramuzpe, A.; Cortes, J. M.

2026-05-13 radiology and imaging 10.64898/2026.05.11.26352893 medRxiv

Top 0.1%

5.5%

Show abstract

The preclinical phase of Alzheimers disease (AD) is characterized by profound biological and structural heterogeneity, challenging our ability to map early pathology onto large-scale brain networks. To address this fundamental challenge, we introduce Functional Deviation Maps ({pi}z), an individualized neuroimaging framework for mapping participant-specific functional architecture to their unique structural atrophy landscape. By fitting a normative model to the voxel-based morphometry of amyloid-negative individuals, we extract personalized "atrophy seeds" (W-scores [≤] -1.96) for amyloid-positive patients, subsequently obtaining their resting-state seed-based connectivity (SBC). By standardizing these participant-level SBC maps against a healthy reference distribution, we show that, despite the highly variable spatial origins of structural atrophy, individual functional deviations converge into a common "atrophy network". Spatial enrichment analyses show that the functional disruption is not random, but preferentially is dominated by the Default Mode Network. Furthermore, by projecting these populational functional deviations onto high-order cognitive topographies, we find a considerable alignment with the brains fundamental unimodal-transmodal and external-internal attentional gradients. Overall, the{pi} z framework transcends conventional group-level averages, offering a highly personalized, biologically meaningful signature of system-level network vulnerability in the earliest stages of AD.

5

DAMPA - accelerated and simplified design of probe panels for targeted metagenomics using pangenome graphs

Payne, M.; Tam, K. K.-G.; Rockett, R. J.; Basile, K.; Bowden, R.; Sintchenko, V.; Kok, J.; Golubchik, T.

2026-05-22 infectious diseases 10.64898/2026.05.15.26352859 medRxiv

Top 0.1%

5.2%

Show abstract

Targeted metagenomics, where samples are enriched for multiple organisms of interest using oligonucleotide probes, is a highly efficient sequencing methodology that is becoming standard practice for genomics of viruses and complex polymicrobial samples. Efficient enrichment critically requires probes that capture both conserved and highly diverse genomic regions without loss of sensitivity, and with uniform representation in the sequencing pool. Design of optimal probesets poses a challenge: existing computational methods use k-mer hashing to reduce over-abundant sequences, but scalability and efficiency drop with increasing numbers of genomes, while diverse sequences remain under-represented. Here we show that incorporating evolutionary distance to compress probes via a graph-based representation of multiple genomes across species, together with k-mer hashing, reduces overrepresentation of conserved sequences, and yields more uniform coverage even of highly diverse loci. We make the method available in Dampa, an open-source tool that generates probesets in seconds on a standard laptop.

6

DigiMus: a connectome-informed spiking framework for multi-region mouse neural-behavior modeling

Liu, Y.; Zhang, X.; Chen, X.; Hao, C.; Yao, W.; Zhang, J.; Sun, Y.; Zhang, T.

2026-06-11 neuroscience 10.64898/2026.06.09.731075 medRxiv

Top 0.1%

4.3%

Show abstract

Computational models are increasingly used to relate mouse brain structure, neural activity and behavior, but most models still learn from task data with limited constraints from biological circuit organization. Here we present DigiMus, a connectome-informed spiking framework for multi-region-capable mouse neural-behavior modeling. DigiMus combines leaky integrate-and-fire spiking dynamics with brain-region-specific motif regularization in a trainable sequence-modeling architecture, allowing directed three-node circuit motifs derived from 38,481 reconstructed neuronal morphologies across approximately 50 brain regions to guide recurrent coupling during learning. We evaluate DigiMus on 18 rule-based cognitive tasks spanning sensorimotor mapping and perceptual decision-making, and on three mouse neural decoding datasets involving auditory discrimination, fixed-interval licking and visual decoding. Across synthetic tasks, DigiMus showed stable performance relative to TCN, LSTM and Transformer baselines, with stronger advantages in more complex decision-making settings. In real neural datasets, single-region instantiations of DigiMus produced small, consistent and dataset-dependent improvements over a structure-free sequence baseline, while retaining motif-prior signatures in trained connectivity. Internal state analyses further linked task-dependent state dynamics to behavioral error patterns. These results suggest that connectome-derived structural priors can shape neural sequence models, and establish DigiMus as a modular, connectome-informed workflow for mouse neural-behavior modeling and hypothesis generation, rather than a complete digital reconstruction.

7

PRISM : Peptide-specificity annotation of T-cell receptors with uncertainty quantification

Venkatraman, D. L.; Mok, L.; Rose, N. R.; Robinson, A.; Jonsson, V. D.

2026-07-04 immunology 10.64898/2026.06.30.735715 medRxiv

Top 0.1%

4.3%

Show abstract

Mapping T-cell receptor (TCR) sequences to their cognate peptide-major histocompatibility complex (pMHC) ligands underlies both basic immunology and T-cell target discovery, yet current models aimed at predicting TCR specificity are limited by sparse labels, viral-biased training data, and an inability to recognize receptors outside their training distribution. We present PRISM, an uncertainty-aware metric-learning framework for TCR{beta} sequence representation. PRISM embeds receptors into a peptide-organized latent space, returns top-k peptides by nearest-neighbor retrieval, and abstains on out-of-distribution receptors by modeling an intrinsic uncertainty that tracks annotation correctness. To offset the viral bias of public databases, PRISM augments training data with structure-guided synthetic receptors that diversify TCR sequences while preserving the energetics of the TCR-pMHC interface. Across a held-out set of 923 peptides and the independent IMMREP23 benchmark, PRISM matches or exceeds sequence-based models, with largest gains on rare epitopes. Finally, PRISM learns attention weights on TCR residues that concentrate on the CDR3{beta} salt-bridge and hydrophobic contacts central to peptide recognition, linking PRISM's positional focus to the biochemical properties of TCR-pMHC structures.

8

geneXplore: An Interactive Browser for X Chromosome-Wide Association Study Results

Cook, N.; Boulais-Richard, J.; Zeng, Y.; Yang, C.; Budde, J.; Taliun, D.; Gagliano Taliun, S. A.; Cruchaga, C.; Belloy, M. E.

2026-07-14 neurology 10.64898/2026.07.14.26357489 medRxiv

Top 0.1%

4.1%

Show abstract

Summary: The X chromosome comprises approximately 5% of the human genome and encodes over 800 protein-coding genes, many of which exhibit sex-differentiated expression patterns due to escape from X chromosome inactivation (XCI) mechanisms. Despite its relevance to sex differences in complex traits, the X chromosome is routinely excluded from genome-wide association studies due to analytical challenges, and when analyzed, the impact of escape from XCI or sex is limitedly explored. No dedicated, publicly accessible browser for X chromosome-wide association study (XWAS) summary statistics currently exists, creating a barrier to systematic investigation of X-linked contributions to human traits. Here, we present geneXplore, an interactive web browser based on the PheWeb2 implementation, tailored for XWAS summary statistics across 1,944 phenotypes while distinguishing random XCI (rXCI), escape from XCI (eXCI), and sex-stratified analyses. Users can explore results via interactive plots (Manhattan and Miami, PheWAS and LocusZoom), searchable tables and access to cross-database lookup, with full summary statistics available for download. Availability and Implementation: geneXplore is freely available at https://genexplore.wustl.edu/ with no registration required and will be maintained for a minimum of two years following publication. Source code is available at https://github.com/Belloy-Lab/geneXplore_XWAS_Browser under an MIT license.

9

Unbiased identification of responding T cell clones from longitudinal repertoire sequencing with CloneSearch

Milighetti, M.; Sethna, Z.; Martis, S.; Reiche, C.; Elhanati, Y.; Balachandran, V. P.; Greenbaum, B. D.; Walczak, A. M.; Mora, T.

2026-06-01 immunology 10.64898/2026.05.29.728700 medRxiv

Top 0.1%

4.0%

Show abstract

T cells activate and expand upon interaction with cognate antigen, derived from pathogens or mutated proteins. T cell clones can be identified by their T cell receptor (TCR) which can act as a unique barcode to track their expansion. Longitudinal TCR sequencing can be used to track T cell responses to a large array of stimuli. However, experimental identification of T cell clones of interest is challenging, especially when information about the driving antigen is lacking. Computational identification based on clonal dynamics is an antigen-agnostic alternative. However, it is subject to sequencing noise and biological variability, and relies on the choice of particular time points that are compared to find expanding and contracting clones. We present CloneSearch, a method to identify expanding and contracting T cell clones from longitudinal TCR sequencing which is agnostic to the time of stimulus and can account for the noise these clones are subject to. We show that CloneSeach can recapitulate previously identified responses from published data, and expand the analysis to show identification of previously undetected responses from these same datasets. We make CloneSearch available at https://github.com/mm523/CloneSearch.

10

Steering Sequence Generation in Protein Language Models through Iterative Lookback Monte Carlo Sampling

Calvanese, F.; Lombardi, G.; Weigt, M.; FERNANDEZ-DE-COSSIO-DIAZ, J.

2026-05-07 bioinformatics 10.64898/2026.05.01.722156 medRxiv

Top 0.1%

3.9%

Show abstract

Protein language models (pLMs) leverage large-scale evolutionary data to generate novel sequences, but steering generation toward desired physicochemical properties without sacrificing diversity remains a major challenge. Existing approaches often induce severe diversity loss or require computationally expensive retraining. We introduce Iterative Lookback Monte Carlo (ILMC), a training-free inference-time sampling strategy that interleaves autoregressive elongation with Metropolis-Hastings refinement to approximate sampling from a maximum-entropy target distribution balancing generative quality and steering objectives. We show theoretically that this target distribution is entropy-maximizing under fixed generative quality and steering constraints, and empirically that ILMC produces more diverse samples than standard autoregressive baselines at matched generative quality. Using simple steering potentials, ILMC improves desired molecular properties, including generating proteins with up to 12{degrees}C higher predicted melting temperature than compute-matched alternative strategies. ILMC naturally applies to classifier-guided steering, where it outperforms purely autoregressive guidance in diversity while maintaining comparable enrichment of target properties. We validate ILMC on family-specific pLMs and on the multi-family model ProGen3.

11

DPLM: Dynamics-aware Protein Language Model via contrastive learning between sequence and molecular dynamics simulation trajectory

Jiang, Y.; Wang, D.; Imam, I. A.; Xu, D.; Shao, Q.

2026-05-04 bioinformatics 10.64898/2026.04.29.721692 medRxiv

Top 0.1%

3.4%

Show abstract

Protein dynamics play a critical role in protein function, yet such important information is missing in many protein language models (PLM). We introduce DPLM, a dynamics-aware protein language model that aligns sequence embeddings with molecular dynamics (MD) trajectory embeddings via contrastive learning. Using MD features encoded by a pretrained video model, DPLM learns sequence representations that correlate with residue-level flexibility and improve protein-level functional clustering compared to static sequence- and structure-based PLMs. Without task-specific training, DPLM outperforms ESM-based representations in zero-shot mutation-effect prediction on multiple deep mutational scanning datasets. When adapted with lightweight task-specific heads, DPLM further achieves top-tier performance on protein stability prediction and intrinsic disorder region identification, demon-strating that contrastive alignment with MD trajectories enables PLMs to capture biologically meaningful dynamic properties.

12

FastDedup - A fast and memory-efficient tool for read deduplication

Ribes, R.; Mandier, C.; Baniel, A.

2026-05-04 bioinformatics 10.64898/2026.04.29.721745 medRxiv

Top 0.1%

3.3%

Show abstract

PCR duplicate removal is a critical first step in high-throughput sequencing pipelines, yet existing tools struggle with speed, memory, or correctness at modern dataset scales. We present FastDedup, a Rust-based FASTX deduplicator that transforms each read or read pair to a compact xxh3 hash fingerprint, drastically reducing memory usage and binding most of the execution time to disk I/ O. Benchmarked against six competing tools on synthetic human WGS datasets up to 300 million reads, FastDedup consistently leads on paired-end data, running more than 10 times faster than fastp. It also outperforms all tools on uncompressed single-end data, deduplicating a million reads in a second. We additionally report correctness failures in prinseq++ and clumpify. FastDedup is available under the MIT License via GitHub, Bioconda, and Cargo.

13

Thoughts-as-Planning: Latent World Models for Chain-of-Thoughts Optimization via Reinforcement Planning

Liu, D.; Yu, Y.; Wu, Y. N.

2026-05-15 neuroscience 10.64898/2026.05.10.724161 medRxiv

Top 0.2%

3.3%

Show abstract

The success of large language models (LLMs) across diverse NLP tasks has elevated the importance of reasoning chain optimization as a critical step in aligning model behavior with task objectives. Existing reasoning chain tuning methods often rely on black-box heuristics or gradient-free search, which lack interpretability, generalization, and sample efficiency. In this work, we introduce Thoughts-as-Planning, a novel framework that formalizes reasoning chain optimization as a sequential decision-making process over a latent semantic space. We model the LLM as a partially observable environment and learn a latent world model that simulates the effect of reasoning chain edits on downstream outputs. A proximity-preserving embedding space is constructed to encode reasoning chain-response dynamics, enabling planning via gradient descent or reinforcement learning. Our method supports multi-scale abstraction, allowing reasoning chain edits at token, segment, and instruction levels to be integrated into a unified planner. Through extensive experiments on language understanding and generation tasks, we demonstrate that Thoughts-as-Planning outperforms state-of-the-art reasoning chain tuning baselines in efficiency, robustness, and generalization, while offering interpretability through its structured planning trajectory. Our code is available at https://github.com/FastLM/Thoughts-as-Planning.

14

Geometric Theoretical Framework for Dynamic Protein Mutation Detection Models: Defect Awareness and Pathogenicity Prediction

Shao, H.

2026-04-26 bioinformatics 10.64898/2026.04.22.720255 medRxiv

Top 0.2%

3.2%

Show abstract

Traditional protein mutation detection and pathogenicity prediction pipelines rely on static single-conformation structural modeling, inherently ignoring conformational flexibility, dynamic ensemble evolution, and the underlying manifold geometry of protein dynamics. This induces systematic detection failures in flexible regions, allosteric sites, and metastable functional domains, yet lacks a rigorous mathematical characterization of such failure mechanisms. In this work, we establish a theorem-driven geometric-algebraic framework for dynamic protein mutation modeling. Starting from a dynamic conformational Riemannian manifold, we construct the latent representation space via representation-induced completion of operator-valued observations, rather than pre-assumed embedding structures. Within this setting, algebraic constraints are not imposed axiomatically but relaxed into learnable approximate Lie algebra regularization, enabling statistical verification of structural consistency. By integrating Levi-Civita connection, geodesic deviation, and heat kernel asymptotics, we introduce a Lipschitz-stable topological spectral defect (TSD,{delta} spec) index that quantifies the intrinsic inconsistency between static representations and dynamic geometric invariants, linking it to curvature-induced instability and Lie algebra deformation. Under a functorial compatibility principle, we design a dual-branch architecture for pathogenicity prediction and defect awareness, realized via local Lie algebra encoding and low-rank spectral approximation. On multi-source datasets (108 curated PDB structures, 1060 validated residues from ClinVar, DMS, MaveDB, and gnomAD), we establish three fundamental theorems and validate key findings: TSD effectively distinguishes pathogenic/functional variants (PTM: {micro} = 0.386, OMIM: {micro} = 0.443, Clin-Var: {micro} = 0.302) from neutral ones (gnomAD: {micro} = -0.660) with high significance (P = 6.67 x 10 -18) and strong classification performance (AUC=0.82-0.86), while correlating strongly with protein stability ({Delta}{Delta}G, Spearman=0.9794, P = 5.38 x 10 -28). TSD further reveals PTM sites as topological hubs and neutral variants as evolutionary topological redundancy, enabling a paradigm shift from sequence alignment to geometric dynamics and providing a physics-based biomarker for variants of uncertain significance (VUS). These results upgrade protein mutation modeling from empirical static prediction to provable dynamic mechanism analysis. The source code of this work is publicly available at https://github.com/Harmenlv/LieFold-AI/tree/main.

15

Explicit representation of germline and non-germline residues improves antibody language modeling

Kim, J.; Blalock, N.; Kulkarni, A.; Nakamura, K.; Romero, P. A.

2026-05-11 immunology 10.64898/2026.05.06.723387 medRxiv

Top 0.2%

3.2%

Show abstract

Antibodies originate from germline templates and are diversified by somatic hypermutation, producing sequences in which conserved germline residues scaffold structure while rare non-germline (NGL) substitutions refine antigen binding. Current antibody language models (ALMs) treat all residues equivalently and inherit a germline bias that systematically down-weights functionally critical NGL mutations as statistical noise. We introduce PRISM, a germline-aware ALM that explicitly represents germline and nongermline residues as distinct token types over a factorized 53-token vocabulary. PRISM achieves state-of-the-art pseudo-perplexity in hypervariable CDRs and is uniquely positively correlated with experimental binding affinity across three deep mutational scanning landscapes on which all compared ALMs anti-correlate. The dual-vocabulary further enables property-specific controllable generation previously unattainable with entangled ALMs. NGL-directed sampling improves physics-based binding scores while GL-directed sampling preserves stability and solubility. These results establish disentangled germline/non-germline representation as a substantive advance in antibody language modeling.

16

STELLAR: A flexible ensemble learning framework integrating rare variants to enhance polygenic risk prediction

Chen, T.; Li, X.; Mazumder, R.; Zhang, H.; Lin, X.

2026-06-09 genetic and genomic medicine 10.64898/2026.06.07.26355109 medRxiv

Top 0.2%

3.2%

Show abstract

Whole-exome and whole-genome sequencing technology has enabled the discovery of rare genetic variants associated with human health and diseases. However, existing statistical methods used for rare variant association testing are not well-suited for building genetic risk prediction models that jointly incorporate rare and common variants. We propose STELLAR, a flexible ensemble learning-based approach to compute rare variant polygenic risk scores (PRS) using association summary statistics to enhance conventional common variant PRS. Our method combines burden-based and penalty-based rare variant analysis and leverages functional annotation information to prioritize potentially causal variants within the prediction models. In simulation studies, PRS using STELLAR consistently showed the highest prediction accuracy compared to models using common variants alone or rare variant burdens. Applied to UK Biobank whole-exome sequencing data (n=310,831) across eight continuous and five binary traits, STELLAR significantly improved prediction accuracy, refined stratification of individuals at the highest genetic risk beyond common variants, and prioritized biologically relevant genes. STELLAR provides a scalable strategy to incorporate rare variants into PRS in addition to common variants, advancing precision risk prediction and enabling more comprehensive assessment of genetic contributions to complex diseases.

17

Bayesian Nonparametrics for Normative Modelling in Multiple Sclerosis via Modularised Inference

Taschler, B.; Nichols, T. E.; Ganjgahi, H.

2026-05-15 radiology and imaging 10.64898/2026.05.10.26352835 medRxiv

Top 0.2%

3.2%

Show abstract

Normative models produce per-subject deviation scores that feed directly into downstream analyses, but typical pipelines (i) treat confounders with ad-hoc or purely linear adjustments, and (ii) pass point estimates of deviation scores directly to the downstream model, ignoring uncertainty. We propose an integrated, two-module Bayesian framework that aims to address both limitations. A normative module based on Bayesian Additive Regression Trees (BART) flexibly captures non-linear effects and higher-order interactions while marginalising over image-quality variables via counterfactual averaging. Crucially, we define individual deviation as di = E[Y|Xi,Zi] - (Zi) with (Z) the feature-conditional population mean, not as a residual. A SoftBART survival model then ingests the full posterior distribution of deviation scores via a cut-posterior construction, propagating upstream uncertainty while blocking feedback from the outcome model. Across challenging simulations and a large clinical data set of multiple sclerosis patients (N>8k), the integrated approach yields better calibration, prediction accuracy and time-varying hazard separation between groups than a two-step plug-in Cox regression model. Modularised inference with BART-based normative deviations improves both flexibility and uncertainty quantification, and extends naturally to other outcomes beyond survival.

18

CuGen: A GPU-accelerated framework for large-scale genomics

Kiiskinen, T.; Richland, J.; Wang, W.; Lu, W. S.; Balasubramanian, N.; Hastie, T.; Tibshirani, R.; Rivas, M. A.

2026-07-17 genetic and genomic medicine 10.64898/2026.07.15.26358178 medRxiv

Top 0.2%

3.2%

Show abstract

Biobank-scale genomic analyses remain computationally expensive, CPU-bound workflows, particularly when adjusting for confounding. Here, we present CuGen, a GPU-accelerated framework for large-scale genomics. CuGen uses UltraLasso, a novel hierarchical application of univariate-guided sparse regression (uniLasso), to select a compact, phenotype-informed active set of fewer than 30,000 variants. This achieves robust leave-one-chromosome-out (LOCO) confounding control, enabling both downstream GWAS and in-sample fine-mapping. Additionally, we introduce the .cugen file format, a genotype representation designed for memory-optimized, high-throughput streaming and random access on GPU hardware. Building on this substrate, we provide a general GPU-accelerated genomics toolkit handling polygenic prediction, data manipulation, quality control, analysis, and visualization. We demonstrate CuGen's efficacy in the UK Biobank with up to 408,624 individuals, where the full GWAS pipeline and fine-mapping against 6.8 million imputed variants completes in approximately 10 minutes on a single high-throughput GPU with 80 GB of memory. The pipeline scales efficiently to massive phenome-wide analyses with sublinear resource consumption.

19

Revisiting CPUs for Protein Folding: Xeon-Based Acceleration of AlphaFold2

Chaudhary, N.; Yang, W.; Kalamkar, D.; Zhou, J.; Ghosh, S.; Xia, L.; Tiwari, M.; Heinecke, A.; Kaul, B.; Misra, S.

2026-05-29 genomics 10.64898/2026.05.27.728222 medRxiv

Top 0.2%

3.2%

Show abstract

Protein structure prediction via AlphaFold2 has revolutionized drug discovery, yet its end-to-end execution remains computationally intensive. While GPUs are traditionally favored for deep learning, the AlphaFold2 algorithm consists of heterogeneous phases -- preprocessing with sparse database searches and model inference with low-arithmetic-intensity attention modules -- that present unique architectural challenges. In this work, we address these bottlenecks by introducing Open-Omics-AlphaFold2, a highly optimized implementation for Intel(R) Xeon(R) CPU. By leveraging the CPUs versatility in handling both sparse preprocessing algorithms and dense matrix operations via Intel Advanced Matrix Extensions (AMX), we accelerate the entire pipeline end-to-end. Our optimization strategy employs multi-level parallelism -- spanning multiprocessing, multi-threading, and vectorization -- alongside cacheaware tiling and operator fusion. Our results demonstrate that, on a Xeon CPU, Open-Omics-AlphaFold2 achieves 2 7.58 speedup for preprocessing and 19.8 29.2 speedup for model inference over baseline Deepmind-AlphaFold2. Moreover, for a proteome of 391 proteins, Open-Omics-AlphaFold2 running on a dual-socket Intel Xeon 6980P system achieves a remarkable 76% higher through-put over the state-of-the-art GPU-accelerated solution, FastFold, running on a single-socket Intel Xeon 6980P CPU with an NVIDIA H100 offioad. Code availabilityBaremetal: https://github.com/IntelLabs/open-omics-alphafold Containerized: https://github.com/IntelLabs/Open-Omics-Accelera tion-Framework/tree/main/pipelines/alphafold2-based-protein-folding

20

BraiNN: A Modern Simulator for Clinically Feasible Personalized Whole-Brain Network Modeling

Fasse, A.; Billi, C.; Garvalov, V.; Morvan, M.; Newton, T.; Kuster, N.; Neufeld, E.

2026-07-13 neuroscience 10.64898/2026.07.08.737156 medRxiv

Top 0.2%

3.2%

Show abstract

Personalized whole-brain modeling aims to transform treatment planning for neurological disorders by enabling patient-specific simulations of brain network dynamics. Neural mass models (NMMs) offer a tractable compromise between biophysical detail and computational cost and can be directly linked to macroscopic observables such as EEG. However, scaling NMMs to whole-brain networks with realistic connectivity, conduction delays, and cortical surface resolution--and fitting them to individual patient data--imposes computational demands that existing frameworks cannot meet at clinically relevant timescales. Here we introduce BraiNN, a JAX-based Python framework for large-scale neural mass modeling that achieves speedups of up to two to three orders of magnitude over existing tools by leveraging GPU/TPU-accelerated, XLA-compiled array computation. BraiNN combines a region-level Jansen-Rit network with a subject-specific cortical surface mesh of coupled neural mass models and biophysically grounded EEG forward modeling via reciprocity-based lead fields. Its fully differentiable computational graph enables a hybrid personalization pipeline that pairs Bayesian optimization for global parameter exploration with gradient-based refinement, completing EEG-driven spectral fitting of an eight-dimensional parameter space in approximately 2-3 hours on a single consumer GPU--compared to multiple days with conventional neural mass modeling software. Numerical verification against established benchmarks confirms that BraiNN faithfully reproduces canonical synchronization and bifurcation dynamics of Jansen-Rit networks. By reducing the time requirements for personalizing a high-detail whole-brain surface model from days to a few hours on consumer-grade hardware, BraiNN brings personalized brain network modeling closer to practical use in clinical contexts. We anticipate that BraiNN will serve as a foundation for patient-specific digital twins and EEG-guided neuromodulation planning.